Presentation de cours de Fouille de données

professor Karim KILANI

étudiante Aleksandra VUCIC

étudiante Aleksandra VUCIC


CLUSTER ANALYSIS

  • Required packages
knitr::opts_chunk$set(echo = TRUE)
#install.packages("dplyr","ade4","magrittr","cluster","texreg",
#"factoextra","cluster.datasets","xtable","kableExtra","ggthemes","Kendall","kendallRandomWalks","VGAM","stargazer",
#"knitr","summarytools", "DT")
library(dplyr)
library(ade4)
library(magrittr)
library(cluster)
library(factoextra)
library(cluster.datasets)
library(xtable)
library(kableExtra)
library(ggthemes)
library(Kendall)
library(kendallRandomWalks)
library(VGAM)
library(knitr)
library(summarytools)
library(DT)

Here is an example of Latex equations in markdown \[ d(\mathbf{x},\mathbf{y})=\frac{1}{2\pi}\int_{\infty}^x e^{-t^2/2}dt \]

Definition of a distance

  • A distance function or a metric on \(\mathbb{R}^m,\:m\geq 1\), is a function \(d:\mathbb{R}^m\times\mathbb{R}^m\rightarrow \mathbb{R}\).

  • A distance function must satisfy some required properties or axioms.

  • There are three main axioms.

  • A1. \(d(\mathbf{x},\mathbf{y})= 0\iff \mathbf{x}=\mathbf{y}\) (identity of indiscernibles);

  • A2. \(d(\mathbf{x},\mathbf{y})= d(\mathbf{y},\mathbf{x})\) (symmetry);

  • A3. \(d(\mathbf{x},\mathbf{z})\leq d(\mathbf{x},\mathbf{y})+d(\mathbf{y},\mathbf{z})\) (triangle inequality), where \(\mathbf{x}=(x_1,\cdots,x_m)\), \(\mathbf{y}=(y_1,\cdots,y_m)\) and \(\mathbf{z}=(z_1,\cdots,z_m)\) are all vectors of \(\mathbb{R}^m\).

  • We should use the term dissimilarity rather than distance when not all the three axioms A1-A3 are valid.

  • Most of the time, we shall use, with some abuse of vocabulary, the term distance.

Exercice 1

  • Prove that the three axioms A1-A3 imply the non-negativity condition: \[d(\mathbf{x},\mathbf{y})\geq 0.\]
  • \(d(\mathbf{x},\mathbf{z})\leq d(\mathbf{x},\mathbf{y})+d(\mathbf{y},\mathbf{z})\)
  • if \(\:x= \:z\)
  • \(d(\mathbf{x},\mathbf{x})\leq d(\mathbf{x},\mathbf{y})+d(\mathbf{y},\mathbf{x})\)
  • \(0\leq 2d(\mathbf{x},\mathbf{y})\)
  • \(0\leq d(\mathbf{x},\mathbf{y})\)

Euclidean distance

  • It is defined by:

\[ d(\mathbf{x},\mathbf{y})=\sqrt{\sum_{j=1}^m (x_j-y_j)^2}. \]

  • A1-A2 are obvious.
  • The proof of A3 is provided below.

Exercice 2

  • Is the squared Euclidean distance a true distance?

Squared Euclidean distance does not form a metric space, as it does not satisfy the triangle inequality.[16] However it is a smooth, strictly convex function of the two points, unlike the distance, which is non-smooth (near pairs of equal points) and convex but not strictly convex. The squared distance is thus preferred in optimization theory, since it allows convex analysis to be used. Since squaring is a monotonic function of non-negative values, minimizing squared distance is equivalent to minimizing the Euclidean distance, so the optimization problem is equivalent in terms of either, but easier to solve using squared distance.[17]

The collection of all squared distances between pairs of points from a finite set may be stored in a Euclidean distance matrix, and is used in this form in distance geometry.

Here is how it looks in R:

df <- data.frame(x=c(0,0), y=c(6,6))
ta <- t(df)

dist(ta, method = "euclidean", diag = TRUE, upper = TRUE)
##          x        y
## x 0.000000 8.485281
## y 8.485281 0.000000

Manhattan distance

  • The Manhattan distance also called taxi-cab metric or city-block metric is defined by:

\[d(\mathbf{x},\mathbf{y}) =\sum_{j=1}^m |x_j-y_j|.\]

  • A1-A2 hold.
  • A3 also holds using the fact that \(|a+b|\leq |a|+|b|\) for any reals \(a,b\).
  • There exists also a weighted version of the Manhattan distance called the Canberra distance.

  • Here is how it looks in R comparing to Euclidean and Manhattan distances:
x = c(0, 0)
y = c(6,6)
#dist(rbind(x, y), method = "euclidian")
dist(rbind(x, y), method = "euclidian",diag=T,upper=T)
##          x        y
## x 0.000000 8.485281
## y 8.485281 0.000000
6*sqrt(2)
## [1] 8.485281
#dist(rbind(x, y), method = "manhattan")
dist(rbind(x, y), method = "manhattan",diag=T,upper=T)
##    x  y
## x  0 12
## y 12  0

Canberra distance

Eliminates scaling effects, for example, if I multiply the data of a component j by a factor lambda the distance remains invariant.

  • It is defined by:

\[d(\mathbf{x},\mathbf{y}) =\sum_{j=1}^m \frac{|x_j-y_j|}{|x_j|+|y_j|}.\]

  • Note that the term \(|x_j-y_j|/(|x_j|+|y_j|)\) is not properly defined as: \(x_j=y_j=0\).
  • By convention we set that term to be zero in that case.
  • The Canberra distance is specially sensitive to small changes near zero.
x = c(0, 0)
y = c(6,6)
dist(rbind(x, y), method = "canberra")
##   x
## y 2
# 6/6+6/6

Exercice 3

  • Prove that the Canberra distance is a true distance, i.e. that it satisfies A1-A3. # Exemple de calcul de distancee de Canbera avec R
data.frame(x=c(0,0), y=c(6,6)) %>% 
t()%>% 
dist(diag = TRUE, upper = TRUE, method = "canberra") %>% 
as.matrix()%>%
datatable(caption="Canberra distance")

Minkowski distance

  • Both the Euclidian and the Manattan distances are special cases of the Minkowski distance which is defined, for \(p\geq 1\), by: \[ d(\mathbf{x},\mathbf{y})= \left[\sum_{j=1} |x_j-y_j|^{p}\right]^{1/p}. \]
  • For \(p=1\), we get the Manhattan distance.
  • For \(p=2\), we get the Euclidian distance.
  • Let us also define: \[\|\mathbf{x}\|_p\equiv\left[\sum_{j=1}^m |x_j|^{p}\right]^{1/p},\] where \(\|\mathbf{\cdot}\|_p\) is known as the \(p\)-norm or Minkowski norm.
  • Note that the Minkowski distance and norm are related by:

\[d(\mathbf{x},\mathbf{y})=\|\mathbf{x}-\mathbf{y}\|_p.\]

  • Conversely, we have:

\[\|\mathbf{x}\|_p=d(\mathbf{x},\mathbf{0}),\]

where \(\mathbf{0}\) is the null-vetor of \(\mathbb{R}^m\).

library("ggplot2")
x = c(0, 0)
y = c(6,6)
MinkowDist=c() # Initialiser à vide la liste
for (p in seq(1,30,.01))
{
MinkowDist=c(MinkowDist,dist(rbind(x, y), method = "minkowski", p = p))     
}

ggplot(data =data.frame(x = seq(1,30,.01), y=MinkowDist ) , mapping = aes( x=x, y= y))+
  geom_point(size=.1,color="red")+
  xlab("p")+ylab("Minkowski Distance")+ggtitle("Minkowski distance wrt p")

Exercice 4

Produce a similar graph using “The Economist” theme. Indicate on the graph the Manhattan, the Euclidian distances as well as the Chebyshev distance introduced below.

library("ggplot2")
library("ggthemes")

x = c(0, 0) 
y = c(6,6) 
MinkowDist=c() 

for (p in seq(1,30,.01)) 
{
  MinkowDist=c(MinkowDist,dist(rbind(x, y), method = "minkowski", p = p)) 
}

ggplot(data =data.frame(x = seq(1,30,.01), y=MinkowDist ), mapping = aes(x = x, y = y))+
geom_point()+
xlim(1,11)+
xlab("p")+
ylab("Minkowski Distance")+
ggtitle("Minkowski distance wrt p") +
theme_economist(base_family = "sans", horizontal = TRUE,
  dkpanel = FALSE) +
scale_colour_economist()
## Warning: Removed 1900 rows containing missing values (geom_point).

    library(ggplot2)
    minkowski <- function(p) {
     res<- as.matrix(data.frame(x=c(0,0), y=c(6,6))) %>%
      t() %>%
      dist(method = "minkowski", p=p, diag = TRUE, upper = TRUE)
      res<-as.matrix(res)[1,2]
      return(res)
}
#curve(minkowski, xlim=c(1,2))
#as.numeric(minkowski(c(1,2,3)))
#as.numeric(minkowski(3))
#typeof(as.numeric(minkowski(3)))
res=c()
for (p in seq(1,5,0.01)) {
  res=c(res,minkowski(p))
}
#res
#plot(seq(1,5,0.01),res, type="l")
df=data.frame(x=seq(1,5,0.01),y=res)
ggplot(df,y=res,aes(x=seq(1,5,0.01), y=res), xlim(1, 1))

#f <-function(p) {
#  return(sqrt(p))
#}
#curve(f)
#as.matrix(data.frame(x=c(0,0), y=c(6,6))) %>%
#  t() %>%
 #dist(method = "minkowski", p=3, diag = TRUE, upper = TRUE)->res
#as.matrix(res)[2,1]

Chebyshev distance

  • At the limit, we get the Chebyshev distance which is defined by: \[ d(\mathbf{x},\mathbf{y})=\max_{j=1,\cdots,n}(|x_j-y_j|)=\lim_{p\rightarrow\infty} \left[\sum_{j=1} |x_j-y_j|^{p}\right]^{1/p}. \]

  • The corresponding norm is:

\[ \|\mathbf{x}\|_\infty=\max_{j=1,\cdots,n}(|x_j|). \]

Minkowski inequality

  • The proof of the triangular inequality A3 is based on the Minkowski inequality:

  • For any nonnegative real numbers \(a_1,\cdots,a_m\); \(b_1,\cdots,b_m\), and for any \(p\geq 1\), we have: \[ \left[\sum_{j=1}^m (a_j+b_j)^{p}\right]^{1/p}\leq \left[\sum_{j=1}^m a_j^{p}\right]^{1/p} +\left[\sum_{j=1}^m b_j^{p}\right]^{1/p}. \]

  • To prove that the Minkowski distance satisfies A3, notice that \[ \sum_{j=1}^m|x_j-z_j|^{p}= \sum_{j=1}^m|(x_j-y_j)+(y_j-z_j)|^{p}. \]

  • Since for any reals \(x,y\), we have: \(|x+y|\leq |x|+|y|\), and using the fact that \(x^p\) is increasing in \(x\geq 0\), we obtain: \[ \sum_{j=1}^m|x_j-z_j|^{p}\leq \sum_{j=1}^m(|x_j-y_j|+|y_j-z_j|)^{p}. \]

  • Applying the Minkowski inequality with \(a_j=|x_j-y_j|\) and \(b_j=|y_j-z_j|\), \(j=1,\cdots,n\), we get: \[ \sum_{j=1}^m|x_j-z_j|^{p}\leq \left(\sum_{j=1}^m |x_j-y_j|^{p}\right)^{1/p}+\left(\sum_{j=1}^m |y_j-z_j|^{p}\right)^{1/p}. \]

Exercice 5

To illustrate the Minkowski inequality, draw \(100\) times two lists of \(100\) draws from the lognormal distribution with mean \(1600\) and standard-deviation \(300\). Illustrate with a graph the gap between the two drawn lists.

Hölder inequality

  • The proof of the Minkowski inequality itself requires the Hölder inequality:
  • For any nonnegative real numbers \(a_1,\cdots,a_m\); \(b_1,\cdots,b_m\), and any \(p,q>1\) with \(1/p+1/q=1\), we have: \[ \sum_{j=1}^m a_jb_j\leq \left[\sum_{j=1}^m a_j^{p}\right]^{1/p} \left[\sum_{j=1}^m b_j^{q}\right]^{1/q} \]
  • The proof of the Hölder inequality relies on the Young inequality:
  • For any \(a,b>0\), we have \[ ab\leq \frac{a^p}{p}+\frac{b^q}{q}, \] with equality occuring iff: \(a^p=b^q\).
  • To prove the Young inequality, one can use the (strict) convexity of the exponential function.
  • For any reals \(x,y\), we have: \[ e^{\frac{x}{p}+\frac{y}{q} }\leq \frac{e^{x}}{p}+\frac{e^{y}}{q}. \]
  • We then set: \(x=p\ln a\) and \(y=q\ln b\) to get the Young inequality.
  • A good reference on inequalities is: Z. Cvetkovski, Inequalities: theorems, techniques and selected problems, 2012, Springer Science & Business Media.

Cauchy-Schwartz inequality

  • Note that the triangular inequality for the Minkowski distance implies: \[ \sum_{j=1}^m |x_j|\leq \left[\sum_{j=1}^m |x_j|^{p}\right]^{1/p}. \]
  • Note that for \(p=2\), we have \(q=2\). The Hölder inequality implies for that special case \[ \sum_{j=1}^m|x_jy_j|\leq\sqrt{\sum_{j=1}^m x_j^2}\sqrt{\sum_{j=1}^m y_j^2}. \]
  • Since the LHS od thes above inequality is greater then \(|\sum_{j=1}^mx_jy_j|\), we get the Cauchy-Schwartz inequality \[ |\sum_{j=1}^mx_jy_j|\leq\sqrt{\sum_{j=1}^m x_j^2}\sqrt{\sum_{j=1}^m y_j^2}. \]
  • Using the dot product notation called also scalar product notation: \(\mathbf{x\cdot y}=\sum_{j=1}^mx_jy_j\), and the norm notation \(\|\mathbf{\cdot}\|_2\), the Cauchy-Schwartz inequality is: \[ |\mathbf{x\cdot y} | \leq \|\mathbf{x}\|_2 \| \mathbf{y}\|_2. \]

Pearson correlation distance

  • The Pearson correlation coefficient is a similarity measure on \(\mathbb{R}^m\) defined by: \[ \rho(\mathbf{x},\mathbf{y})= \frac{\sum_{j=1}^m (x_j-\bar{\mathbf{x}})(y_j-\bar{\mathbf{y}})}{{\sqrt{\sum_{j=1}^m (x_j-\bar{\mathbf{x}})^2\sum_{j=1}^m (y_j-\bar{\mathbf{y}})^2}}}, \] where \(\bar{\mathbf{x}}\) is the mean of the vector \(\mathbf{x}\) defined by: \[\bar{\mathbf{x}}=\frac{1}{n}\sum_{j=1}^m x_j,\]

  • Note that the Pearson correlation coefficient satisfies P2 and is invariant to any positive linear transformation, i.e.: \[\rho(\alpha\mathbf{x},\mathbf{y})=\rho(\mathbf{x},\mathbf{y}),\] for any \(\alpha>0\).

  • The Pearson distance (or correlation distance) is defined by: \[ d(\mathbf{x},\mathbf{y})=1-\rho(\mathbf{x},\mathbf{y}). \]

  • Note that the Pearson distance does not satisfy A1 since \(d(\mathbf{x},\mathbf{x})=0\) for any non-zero vector \(\mathbf{x}\). It neither satisfies the triangle inequality. However, the symmetry property is fullfilled.

Cosine correlation distance

  • The cosine of the angle \(\theta\) between two vectors \(\mathbf{x}\) and \(\mathbf{y}\) is a measure of similarity given by: \[ \cos(\theta)=\frac{\mathbf{x}\cdot \mathbf{y}}{\|\mathbf{x}\|_2\|\mathbf{y}\|_2}=\frac{\sum_{j=1}^m x_j y_j}{{\sqrt{\sum_{j=1}^m x_j^2\sum_{j=1}^m y_j^2}}}. \]
  • Note that the cosine of the angle between the two centred vectors \(\mathbf{x}-\bar{\mathbf{x}}\mathbf{1}\) and \(\mathbf{y}-\bar{\mathbf{y}}\mathbf{1}\) coincides with the Pearson correlation coefficient of \(\mathbf{x}\) and \(\mathbf{y}\), where \(\mathbf{1}\) is a vector of units of \(\mathbb{R}^m\).
  • The cosine correlation distance is defined by: \[ d(\mathbf{x},\mathbf{y})=1-\cos(\theta). \]
  • It shares similar properties than the Pearson correlation distance. Likewise, Axioms A1 and A3 are not satisfied.

Spearman correlation distance

  • To calculate the Spearman’s rank-order correlation, we need to map seperately each of the vectors to ranked data values: \[\mathbf{x}\rightarrow \text{rank}(\mathbf{x})=(x_1^r,\cdots,x_m^r).\]
  • Here, \(x_j^r\) is the rank of \(x_j\) among the set of values of \(\mathbf{x}\).
  • We illustrate this transformation with a simple example:
  • If \(\mathbf{x}=(3, 1, 4, 15, 92)\), then the rank-order vector is \(\text{rank}(\mathbf{x})=(2,1,3,4,5)\).
x=c(3, 1, 4, 15, 92)
rank(x)
## [1] 2 1 3 4 5
  • The Spearman’s rank correlation of two numerical vectors \(\mathbf{x}\) and \(\mathbf{y}\) is simply the Pearson correlation of the two correspnding rank-order vectors \(\text{rank}(\mathbf{x})\) and \(\text{rank}(\mathbf{y})\), i.e. \(\rho(\text{rank}(\mathbf{x}),\text{rank}(\mathbf{y}))\). This measure is is useful because it is more robust against outliers than the Pearson correlation.
  • If all the \(n\) ranks are distinct, it can be computed using the following formula: \[ \rho(\text{rank}(\mathbf{x}),\text{rank}(\mathbf{y}))=1-\frac{6\sum_{j=1}^m d_j^2}{n(n^2-1)}, \] where \(d_j=x_j^r-y_j^r,\:j=1,\cdots,n\).
  • The spearman distance is then defined by: \[ d(\mathbf{x},\mathbf{y})=1-\rho(\text{rank}(\mathbf{x}),\text{rank}(\mathbf{y})). \]
  • It can be shown that easaly that it is not a proper distance.
  • If all the \(n\) ranks are distinct, we get: \[ d(\mathbf{x},\mathbf{y})=\frac{6\sum_{j=1}^m d_j^2}{n(n^2-1)}. \]
x=c(3, 1, 4, 15, 92)
#rank(x)
y=c(30,2 , 9, 20, 48)
#rank(y)
d=rank(x)-rank(y)
#d
cor(rank(x),rank(y))
## [1] 0.7
# it is equal to 1-6*sum(d^2)/(5*(5^2-1))

Exercice 6

  • For the two vectors \(\mathbf{x}=(22,34,1,12,25,56,7)\) and \(\mathbf{y}=(2,64,12,2,22,5,8)\) :
  • Calculate the ranks for each vector.
  • Deduce the Spearman correlation distance from that ranks.
  • Deduce the Spearman correlation distance from the above dispalyed alternative equation.
  • Calculate the Spearman correlation distance using the R function.
# Création des listes
x = c(22,34,1,12,25,56,7)
y = c(2,64,12,2,22,5,8)

n = 7 # nombre d'entitées dans la liste

# Calcul des rangs des listes
rX = rank(x)
rY = rank(y)

# Affichage des rangs
rX
## [1] 4 6 1 3 5 7 2
rY
## [1] 1.5 7.0 5.0 1.5 6.0 3.0 4.0
#Calcul de la distance à partir des rangs
d = rX - rY
d
## [1]  2.5 -1.0 -4.0  1.5 -1.0  4.0 -2.0
# Calcul de la distance de la correlation
1-6*sum(d^2)/(n *(n^2-1))
## [1] 0.1696429
# Utilisation de la fonction de corelation de Spearman
cor(rX, rY)
## [1] 0.1621687

Kendall tau distance

  • The Kendall rank correlation coefficient is calculated from the number of correspondances between the rankings of \(\mathbf{x}\) and the rankings of \(\mathbf{y}\).
  • The number of pairs of observations among \(n\) observations or values is: \[{n \choose 2} =\frac{n(n-1)}{2}.\]
  • The pairs of observations \((x_{i},x_{j})\) and \((y_{i},y_{j})\) are said to be concordant if: \[\text{sign}(x_j-x_j)=\text{sign}(y_j-y_j),\] and to be discordant if: \[\text{sign}(x_j-x_j)=-\text{sign}(y_j-y_j),\] where \(\text{sign}(\cdot)\) returns \(1\) for positive numbers and \(-1\) negative numbers and \(0\) otherwise.
  • If \(x_j=x_j\) or \(y_j=y_j\) (or both), there is a tie.
  • The Kendall \(\tau\) coefficient is defined by (neglecting ties): \[\tau =\frac {1}{n(n-1)}\sum_{j=1}^{n}\sum_{j=1}^m\text{sign}(x_j-x_j)\text{sign}(y_j-y_j).\]
  • Let \(n_c\) (resp. \(n_d\)) be the number of concordant (resp. discordant) pairs, we have \[\tau =\frac {2(n_c-n_d)}{n(n-1)}.\]
  • The Kendall tau distance is then: \[d(\mathbf{x},\mathbf{y})=1-\tau. \]
  • Remark: the triangular inequality may fail in cases where there are ties.
x=c(3, 1, 4, 15, 92)
y=c(30,2 , 9, 20, 48)
tau=0
for (i in 1:5)
{  
tau=tau+sign(x -x[i])%*%sign(y -y[i])
}
tau=tau/(5*4)
tau
##      [,1]
## [1,]  0.6
cor(x,y, method="kendall")
## [1] 0.6

Exercice 7

  • For the two vectors \(\mathbf{x}=(22,34,1,12,25,56,7)\) and \(\mathbf{y}=(2,64,12,2,22,5,8)\):
  • List all pairs of coordinates.
  • How many pairs are there?
  • For each pair and each sector, compute the signs of the differences in coordinates.
  • Deduce the Kendall tau coefficient using the above computations.
  • Calculate the Kendall tau coefficient using the R function.

Kendall’s tau is a measure of dependency in a bivariate distribution. Loosely, two random variables are concordant if large values of one random variable are associated with large values of the other random variable. Similarly, two random variables are disconcordant if large values of one random variable are associated with small values of the other random variable.

x=c(22,34,1,12,25,56,7)
y=c(2,64,12,2,22,5,8)
# there are 49 rows- pairs 7x7
#expand.grid(x,y)

d1 <- expand.grid(x,y)
# shows only head of datframe
knitr::kable(head(d1), format = 'html')
Var1 Var2
22 2
34 2
1 2
12 2
25 2
56 2
ggplot(d1,aes(x=Var1,y=Var2,label=row.names(d1)))+geom_point()+geom_text(label=row.names(d1),nudge_x = 0.25, nudge_y = 0.25, 
    check_overlap = T)

kendall.tau(x, y)  # A random sample is taken here
## [1] 0.0952381
res<-cor.test(x,y, method="kendall")
## Warning in cor.test.default(x, y, method = "kendall"): Cannot compute exact p-
## value with ties
res
## 
##  Kendall's rank correlation tau
## 
## data:  x and y
## z = 0.30382, p-value = 0.7613
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## 0.09759001
#kendall.tau(x, y)  # A random sample is taken here
#kendall.tau(x, y, exact = TRUE)  # Costly if length(x) is large
#kendall.tau(x, y, max.n = 50)     # Same as exact = TRUE

Standardization

  • Variables or measurements are often standardized before calculating dissimilarities.
  • Standardization converts the original variables into uniteless variables.
  • A well known method is the z-score transformation.
  • Let \(\mathbf{v}\equiv (v_1,\cdots,v_n )\) a vector of measurements recrded for \(n\) individuals or objects. \[ \mathbf{v}\rightarrow (\frac{v_1-\bar{\mathbf{v}}}{s_\mathbf{v}},\cdots,\frac{v_n-\bar{\mathbf{v}}}{s_\mathbf{v}}), \] where \(\bar{\mathbf{v}},s_\mathbf{v}\) are the sample mean and standard-deviation, respectively, given by: \[ \bar{\mathbf{v}}=\frac{1}{n}\sum_{i=1}^n v_i,\: s_\mathbf{v}=\frac{1}{n-1}\sum_{i=1}^n(v_i-\bar{\mathbf{v}})^2. \]
  • The transformed variable will have a mean of \(0\) and a variance of \(1\).
  • The result obtained with Pearson correlation measures and standardized Euclidean distances are comparable.
  • For other methods, see: Milligan, G. W., & Cooper, M. C. (1988). A study of standardization of variables in cluster analysis. Journal of classification, 5(2), 181-204
v=c(3, 1, 4, 15, 92)
w=c(30,2 , 9, 20, 48)
(v-mean(v))/sd(v)
## [1] -0.5134116 -0.5647527 -0.4877410 -0.2053646  1.7712699
scale(v)
##            [,1]
## [1,] -0.5134116
## [2,] -0.5647527
## [3,] -0.4877410
## [4,] -0.2053646
## [5,]  1.7712699
## attr(,"scaled:center")
## [1] 23
## attr(,"scaled:scale")
## [1] 38.9551
(w-mean(w))/sd(w)
## [1]  0.45263128 -1.09293895 -0.70654639 -0.09935809  1.44621214
scale(w)
##             [,1]
## [1,]  0.45263128
## [2,] -1.09293895
## [3,] -0.70654639
## [4,] -0.09935809
## [5,]  1.44621214
## attr(,"scaled:center")
## [1] 21.8
## attr(,"scaled:scale")
## [1] 18.11629

Exercice 8

  • Consider the following example

From Kaufman & Rousseeuw, p. 6-8

  • Plot the data using a nice scatter plot.
  • Transform the Height from centimeters (cm) into feet (ft).
  • Display your data in a table.
  • Plot the data within a new scatter plot.
  • What do you observe?
  • Standardize the two variables Age and Height.
  • Display your data in a table.
  • Plot the standardized data within a new scatter plot.
  • Conclude.
df=data.frame(Person=c('llan','Jacqueline','Kim','Lieve','Leon','Peter','Talia','Tina'), Sex=c(1,0,0,0,1,1,0,0), Married=c(0,1,0,1,1,1,0,0),    Fair=c(1,0,1,0,0,0,0,0), Eyes=c(1,0,0,0,0,0,1,1),   Wears=c(0,1,0,0,1,1,0,0),   Roundface=c(0,0,0,0,1,0,1,1), Pessimist=c(1,0,1,0,0,1,0,0), Evining_type=c(0,0,0,1,1,1,0,0),    only_child=c(0,0,0,1,1,0,0,0),  Left_hended=c(0,0,1,0,0,0,0,0))
#df
#set.seed(893)
#datat<-as.data.frame(t(data))
#df=lapply(df,as.factor)
#Ilan=df$Ilan
#Talia =df$Talia
#print(ctable(Ilan,Talia,prop = 'n',style = "rmarkdown"))
#df$sex
#tt= as.factor(df$sex)
#levels(tt)=c("female","male")
#levels(tt)
#df
#prox.sm <- sm(df)
#prox.sm
#corr= dist.binary(df, method=2,diag = TRUE)
#corr

Similarity measures for binary data

  • A common simple situation occurs when all information is of the presence/absence of 2-level qualitative characters.

  • We assume there are \(n\) characters.

  • *The presence of the character is coded by \(1\) and the absence by 0.

  • We have have at our disposal two vectors.

  • \(\mathbf{x}\) is observed for a first individual (or object).

  • \(\mathbf{y}\) is observed for a second individual.

  • We can then calculate the following four statistics:

    \(a=\mathbf{x\cdot y}=\sum_{j=1}^mx_jy_j.\)

    \(b=\mathbf{x\cdot (1-y)}=\sum_{j=1}^mx_j(1-y_j).\)

    \(c=\mathbf{(1-x)\cdot y}=\sum_{j=1}^m(1-x_j)y_j.\)

    \(d=\mathbf{(1-x)\cdot (1-y)}=\sum_{j=1}^m(1-x_j)(1-y_j).\)

  • The counts of matches are \(a\) for \((1,1)\) and \(d\) for \((0,0)\);

  • The counts of mismatches are \(b\) for \((1,0)\) and \(c\) for \((0,1)\).

  • Note that obviously: \(a+b+c+d= n\).

  • This gives a very useful \(2 \times 2\) association table.

Second individual
1 0 Totals
First individual 1 \(a\) \(b\) \(a+b\)
0 \(c\) \(d\) \(c+d\)
Totals \(a+c\) \(b+d\) \(n\)
  • The data shows \(8\) people (individuals) and \(10\) binary variables:
  • Sex, Married, Fair Hair, Blue Eyes, Wears Glasses, Round Face, Pessimist, Evening Type, Is an Only Child, Left-Handed.
data=c(
1,0,1,1,0,0,1,0,0,0,
0,1,0,0,1,0,0,0,0,0,
0,0,1,0,0,0,1,0,0,1,
0,1,0,0,0,0,0,1,1,0,
1,1,0,0,1,1,0,1,1,0,
1,1,0,0,1,0,1,1,0,0,
0,0,0,1,0,1,0,0,0,0,
0,0,0,1,0,1,0,0,0,0
)
data=data.frame(matrix(data, nrow=8,byrow=T))
row.names(data)=c("Ilan","Jacqueline","Kim","Lieve","Leon","Peter","Talia","Tina")
names(data)=c("Sex", "Married", "Fair Hair", "Blue Eyes", "Wears Glasses", "Round Face", "Pessimist", "Evening Type", "Is an Only Child", "Left-Handed")
  • We are comparing the records for Ilan with Talia.
set.seed(893)
datat<-as.data.frame(t(data))
datat=lapply(datat,as.factor)
Ilan=datat$Ilan
Talia =datat$Talia
print(ctable(Ilan,Talia,prop = 'n',style = "rmarkdown"))

Cross-Tabulation

Ilan * Talia

Talia 0 1 Total
Ilan
0 5 1 6
1 3 1 4
Total 8 2 10
  • Therefore: \(a = 1,\:b = 3,\: c = 1,\: d = 5\).
  • Note that interchanging Ilan and Talia would permute \(b\) and \(c\) while leaving \(a\) and \(d\) unchanged.
  • A good similarity or dissimilarity coefficient must treat \(b\) and \(c\) symmetrically.
  • A similarity measure is denoted by: \(s(\mathbf{x},\mathbf{y})\).
  • The corresponding distance is then defined as: \[d(\mathbf{x},\mathbf{y})=1-s(\mathbf{x},\mathbf{y}).\]
  • Alternatively, we have: \[d(\mathbf{x},\mathbf{y})=\sqrt{1-s(\mathbf{x},\mathbf{y})}.\]
  • A list of some of the similarity measures \(s(\mathbf{x},\mathbf{y})\) that have been suggested for binary data is shown below.
  • An more complete list can be found in: Gower, J. C., & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefficients. Journal of classification, 3(1), 5-48.
Coefficient \(s(\mathbf{x},\mathbf{y})\) \(d(\mathbf{x},\mathbf{y})=1-s(\mathbf{x},\mathbf{y})\)
Simple matching \(\frac{a+d}{a+b+c+d}\) \(\frac{b+c}{a+b+c+d}\)
Jaccard \(\frac{a}{a+b+c}\) \(\frac{b+c}{a+b+c}\)
Rogers and Tanimoto (1960) \(\frac{a+d}{a+2(b+c)+d}\) \(\frac{2(b+c)}{a+2(b+c)+d}\)
Gower and Legendre (1986) \(\frac{2(a+d)}{2(a+d)+b+c}\) \(\frac{b+c}{2(a+d)+b+c}]\)
Gower and Legendre (1986) \(\frac{2a}{2a+b+c}\) \(\frac{b+c}{2a+b+c}\)
  • To calculate these coefficients, we use the function: dist.binary(). available in the ade4 package.

  • All the distances in the ade4 package are of type \(d(\mathbf{x}.\mathbf{y})= \sqrt{1 - s(\mathbf{x}.\mathbf{y})}\).

library(ade4)
a=1
b=3
c=1
d=5
dist.binary(data[c("Ilan","Talia"),],method=2)^2
  Ilan

Talia 0.4

1-(a+d )/(a+b+c+d)

[1] 0.4

dist.binary(data[c("Ilan","Talia"),],method=1)^2
  Ilan

Talia 0.8

1-a/(a+b+c)

[1] 0.8

dist.binary(data[c("Ilan","Talia"),],method=4)^2
       Ilan

Talia 0.5714286

1-(a+d )/(a+2*(b+c)+d)

[1] 0.5714286

# One Gower coefficient is missing
dist.binary(data[c("Ilan","Talia"),],method=5)^2
       Ilan

Talia 0.6666667

1-2*a/(2*a+b+c)

[1] 0.6666667

  • The reason for such a large number of possible measures has to do with the apparent uncertainty as to how to deal with the count of zero-zero matches \(d\).
  • The measues embedding \(d\) are sometimes called symmetrical.
  • The other measues are called assymmetrical.
  • In some cases, of course, zero_zero matches are completely equivalent to one–one matches, and therefore should be included in the calculated similarity measure.
  • An example is gender, where there is no preference as to which of the two categories should be coded zero or one.
  • But in other cases the inclusion or otherwise of \(d\) is more problematic; for example, when the zero category corresponds to the genuine absence of some property, such as wings in a study of insects.

Exercice 9

KAUFMAN and ROUSSEEUW, p. 296 KAUFMAN and ROUSSEEUW p. 297

Exercice 10

Nominal variables

  • We previously studied above binary variables which can only take on two states coded \(0,1\).
  • We generalize this approach to nominal variables which may take on more than two states.
  • Eye’s color may have for example four states: blue, brown, green, grey.
  • Le \(M\) be the number of states and code the outcomes as \(1, \cdots, M\).
  • We may choose \(1 =\text{blue},\) \(2 =\text{brown},\) \(3 =\text{green},\) and \(4 =\text{grey}\).
  • These states are not ordered in any way
  • One strategy would be creating a new binary variable for each of the \(M\) nominal states.
  • Then to put it equal to \(1\) if the corresponding state occurs and to \(0\) otherwise.
  • After that, one could resort to one of the dissimilarity coefficients of the previous subsection.
  • The most common way of measuring the similarity or dissimilarity between two objects through categorial variables is the simple matching approach.
  • If \(\mathbf{x},\mathbf{y},\) are both \(m\) nominal records for two individuals,
  • Let define the function: \[\delta(x_j,y_j)\equiv \begin{cases}0, \text{ if } x_j=y_j;\\1,\text{ if } x_j \neq y_j.\end{cases}\]
  • Let \(N_{a+d}\) be the number of attributes of the two individuals on which the two records match: \[N_{a+d}=\sum_{j=1}^m[1-\delta(x_j,y_j)] .\]
  • Let \(N_{b+c}\) be the number of attributes on which the two records do not match: \[N_{b+c}= \sum_{j=1}^m\delta(x_j,y_j).\]
  • Let \(N_d\) be the number of attributes on which the two records match in a “not applicable” category.
  • The distance corresponding to the simple matching approach is: \[ d(\mathbf{x},\mathbf{y})=\frac{\sum_{j=1}^m\delta(x_j,y_j)}{n}=\frac{N_{a+d}}{N_{a+d}+N_{b+c}}. \]
  • Note that simple matching has exactly the same meaning as in the preceding section.

From: GAN et al. (2007)

Gower’s dissimilarity

  • Gower’s coefficient is a dissimilarity measure specifically designed for handling mixed attribute types or variables.

  • See: GOWER, John C. A general coefficient of similarity and some of its properties. Biometrics, 1971, p. 857-871.

  • The coefficient is calculated as the weighted average of attribute contributions.

  • Weights usually used only to indicate which attribute values could actually be compared meaningfully.

  • The formula is: \[ d(\mathbf{x},\mathbf{y})=\frac{\sum_{j=1}^m w_j \delta(x_j,y_j)}{\sum_{j=1}^m w_j}. \]

  • The wheight \(w_j\) is put equal to \(1\) when both measurements \(x_j\) and \(y_j\) are nonmissing,

  • The number \(\delta(x_j,y_j)\) is the contribution of the \(j\)th measure or variable to the dissimilarity measure.

  • If the \(j\)th measure is nominal, we take
    \[ \delta(x_j,y_j)\equiv \begin{cases}0, \text{ if } x_j=y_j;\\1,\text{ if } x_j \neq y_j.\end{cases} \]

  • If the \(j\)th measure is interval-scaled, we take instead: \[ \delta(x_j,y_j)\equiv \frac{|x_j-y_j|}{R_j}, \] where \(R_j\) is the range of variable \(j\) over the available data.

  • Consider the following data set:

From: Struyf, A., Hubert, M., & Rousseeuw, P. (1997). Clustering in an object-oriented environment. Journal of Statistical Software, 1(4), 1-30.

  • The dataset contains 18 flowers and 8 characteristics:
  1. Winters: binary, indicates whether the plant may be left in the garden when it freezes.
  2. Shadow: binary, shows whether the plant needs to stand in the shadow.
  3. Tubers (Tubercule): asymmetric binary, distinguishes between plants with tubers and plants that grow in any other way.
  4. Color: nominal, specifies the flower’s color (1=white, 2=yellow, 3= pink, 4=red, 5= blue).
  5. Soil: ordinal, indicates whether the plant grows in dry (1), normal (2), or wet (3) soil.
  6. Preference: ordinal, someone’s preference ranking, going from 1 to 18.
  7. Height: interval scaled, the plant’s height in centimeters.
  8. Distance: interval scaled, the distance in centimeters that should be left between the plants.
  • The dissimilarity between Begonia and Broom (Genêt) can be calculated as follows:

Begonia

Genêt

library(cluster)
library(dplyr)
data <-flower %>% 
rename(Winters=V1,Shadow=V2,Tubers=V3,Color=V4,Soil=V5,Preference=V6,Height=V7,Distance=V8) %>%
mutate(Winters=recode(Winters,"1"="Yes","0"="No"),
      Shadow=recode(Shadow,"1"="Yes","0"="No"),
      Tubers=recode(Tubers,"1"="Yes","0"="No"),
      Color=recode(Color,"1"="white", "2"="yellow", "3"= "pink", "4"="red", "5"="blue"),
      Soil=recode(Soil,"1"="dry", "2"="normal", "3"= "wet")
      ) 
row.names(data)=c("Begonia","Broom","Camellia","Dahlia","Forget-me-not","Fuchsia",
 "Geranium", "Gladiolus","Heather","Hydrangea","Iris","Lily","Lily-of-the-valley",
 "Peony","Pink Carnation","Red Rose","Scotch Rose","Tulip")
#flower
#row.names(flower)
#daisy(flower, metric=c("gower"), stand=FALSE, type=list(), warnBin=warnType, warnAsym=warnType,warnConst=warnType, warnType=TRUE)
#matr= daisy(flower, metric=c("gower"))
#matr
#as.matrix(matr)[1:2,1:2]
#data.frame(as.matrix(matr))
#knitr::kable(data.frame(as.matrix(matr)),format="simple", caption="Tableau")
#fviz_dist(dist.eucl)
res=lapply(data,class)  
res=as.data.frame(res)
res[1,] %>% 
knitr::kable()
Winters Shadow Tubers Color Soil Preference Height Distance
factor factor factor factor ordered ordered numeric numeric
flower[1:2,]
##   V1 V2 V3 V4 V5 V6  V7 V8
## 1  0  1  1  4  3 15  25 15
## 2  1  0  0  2  1  3 150 50
max(data$Height)-min(data$Height)
## [1] 180
max(data$Distance)-min(data$Distance)
## [1] 50

\[ \frac{|1-0|+|0-1|+|0-1|+1+|1-3|/2+|3-15|/17+|150-25|/180+|50-15|/50}{8}\approx 0.8875408 \]

Daisy function

Cluster package description available at this link.

library(cluster)
(abs(1-0)+abs(0-1)+abs(0-1)+1+abs(1-3)/2+abs(3-15)/17+abs(150-25)/180+abs(50-15)/50)/8
## [1] 0.8875408
dist<-daisy(data[,1:8],metric = "Gower")
as.matrix(dist)[1:2,1:2]
##           Begonia     Broom
## Begonia 0.0000000 0.8875408
## Broom   0.8875408 0.0000000

More on distance matrix computation

  • We use a subset of the data by taking 15 random rows among the 50 rows in the data set.
  • We are using the function sample().
  • We standardize the data using the function scale().
knitr::kable(USArrests,caption = "USArrest data set",
digits = 1,align=c("c","c","c","c"),booktabs = TRUE)%>%
kable_styling(latex_options = "HOLD_position")
USArrest data set
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Connecticut 3.3 110 77 11.1
Delaware 5.9 238 72 15.8
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8
Hawaii 5.3 46 83 20.2
Idaho 2.6 120 54 14.2
Illinois 10.4 249 83 24.0
Indiana 7.2 113 65 21.0
Iowa 2.2 56 57 11.3
Kansas 6.0 115 66 18.0
Kentucky 9.7 109 52 16.3
Louisiana 15.4 249 66 22.2
Maine 2.1 83 51 7.8
Maryland 11.3 300 67 27.8
Massachusetts 4.4 149 85 16.3
Michigan 12.1 255 74 35.1
Minnesota 2.7 72 66 14.9
Mississippi 16.1 259 44 17.1
Missouri 9.0 178 70 28.2
Montana 6.0 109 53 16.4
Nebraska 4.3 102 62 16.5
Nevada 12.2 252 81 46.0
New Hampshire 2.1 57 56 9.5
New Jersey 7.4 159 89 18.8
New Mexico 11.4 285 70 32.1
New York 11.1 254 86 26.1
North Carolina 13.0 337 45 16.1
North Dakota 0.8 45 44 7.3
Ohio 7.3 120 75 21.4
Oklahoma 6.6 151 68 20.0
Oregon 4.9 159 67 29.3
Pennsylvania 6.3 106 72 14.9
Rhode Island 3.4 174 87 8.3
South Carolina 14.4 279 48 22.5
South Dakota 3.8 86 45 12.8
Tennessee 13.2 188 59 26.9
Texas 12.7 201 80 25.5
Utah 3.2 120 80 22.9
Vermont 2.2 48 32 11.2
Virginia 8.5 156 63 20.7
Washington 4.0 145 73 26.2
West Virginia 5.7 81 39 9.3
Wisconsin 2.6 53 66 10.8
Wyoming 6.8 161 60 15.6
##  kable_classic(latex_options = "hold_position")
set.seed(123)
ss <- sample(1:50,15) 
df <- USArrests[ss, ] 
df.scaled <- scale(df)
knitr::kable(df.scaled,caption = "Sample of USArrest data set",digits = 4,align=c("c","c","c"),booktabs = TRUE)%>%
kable_classic(latex_options = "hold_position")
Sample of USArrest data set
Murder Assault UrbanPop Rape
New Mexico 0.5851 1.0230 0.2251 0.6110
Iowa -1.7022 -1.5476 -0.6892 -1.4389
Indiana -0.4591 -0.9078 -0.1266 -0.4829
Arizona -0.2354 1.1240 0.9284 0.5026
Tennessee 1.0326 -0.0659 -0.5486 0.0986
Texas 0.9083 0.0801 0.9284 -0.0394
Oregon -1.0309 -0.3914 0.0141 0.3351
West Virginia -0.8320 -1.2670 -1.9552 -1.6360
Missouri -0.0116 -0.1781 0.2251 0.2267
Montana -0.7575 -0.9527 -0.9706 -0.9362
Nebraska -1.1801 -1.0312 -0.3376 -0.9264
California -0.0116 0.9220 1.7020 1.4487
South Carolina 1.3309 0.9557 -1.3222 -0.3351
Nevada 0.7840 0.6526 0.9987 1.9809
Florida 1.5796 1.5843 0.9284 0.5913
  • The R functions for computing distances.
  1. dist() function accepts only numeric data.
  2. get_dist() function [factoextra package] accepts only numeric data. it supports correlation-based distance measures.
  3. daisy() function [cluster package] is able to handle other variable types (nominal, ordinal, …).
  • Remark: All these functions compute distances between rows of the data.

  • Remark: If we want to compute pairwise distances between variables, we must transpose the data to have variables in the rows.

  • We first compute Euclidian distances

dist.eucl <- dist(df.scaled, method = "euclidean",upper = TRUE)


#stargazer(as.data.frame(as.matrix(dist.eucl)[1:3, 1:3]),header=TRUE, type='html',summary=FALSE,digits=1)


round(sqrt(sum((df.scaled["New Mexico",]-df.scaled["Iowa",])^2)),1)

[1] 4.1

round(sqrt(sum((df.scaled["New Mexico",]-df.scaled["Indiana",])^2)),1)

[1] 2.5

round(sqrt(sum((df.scaled["Iowa",]
-df.scaled["Indiana",])^2)),1)

[1] 1.8

  • We also compute correlation based distances.
library("factoextra")
dist.cor <- get_dist(df.scaled, method = "pearson")
round(as.matrix(dist.cor)[1:3, 1:3], 1)
##            New Mexico Iowa Indiana
## New Mexico        0.0  1.7     2.0
## Iowa              1.7  0.0     0.3
## Indiana           2.0  0.3     0.0
round(1-cor(df.scaled["New Mexico",],df.scaled["Iowa",]),1)
## [1] 1.7
round(1-cor(df.scaled["New Mexico",],df.scaled["Indiana",]),1)
## [1] 2
round(1-cor(df.scaled["Iowa",],df.scaled["Indiana",]),1)
## [1] 0.3

Visualizing distance matrices

  • A simple solution for visualizing the distance matrices is to use the function fviz_dist() [factoextra package].
  • Other specialized methods will be described later on.
library(factoextra)
fviz_dist(dist.eucl)

Partitioning Clustering

  • Partitioning clustering are clustering methods used to classify observations within a data set, into multiple groups based on their similarity.
  • The algorithms require the analyst to specify the number of clusters to be generated.
  • We describe the commonly used partitioning clustering, including:
  1. K-means clustering (MacQueen, 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster. The K-means method is sensitive to anomalous data points and outliers.
  2. K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990), in which, each cluster is represented by one of the objects in the cluster. PAM is less sensitive to outliers compared to k-means.
  3. CLARA algorithm (Clustering Large Applications), which is an extension to PAM adapted for large data sets.

K-Means Clustering

  • The description of the algorithm is based on:
  • HARTIGAN, John A. Clustering algorithms. John Wiley & Sons, Inc., 1975.
  • The data used by the author are provided below.

  • The principal nutrients in meat, fish, and fowl are listed.
  • Recall that 1oz= 28.34952g.
  • Estimated daily dietary allowances are: food energy (3200 cal), protein (70 g), calcium (0.8 g), and iron (10 mg).
  • Table 4.2 convents the variables (with the exception of Fat) in percentage of food delivery.

Data

  • For e.g., the first (BB) ligne is obtained in the following way:
  • \(340/3200=11\%\text{(Food Energy)}\).
  • \(20/70=29\%\text{(Protein)}\).
  • \({0.009}/{0.8}=1\%\text{ (Calcium)}\).
  • \({2.6}/{10}= 26\%\text{ (Iron)}\).
  • An argument could be made that iron is less important than calories or protein and so should be given less weight or ignored entirely.
  • There are \(n\) objects and \(k\) clusters, \(k\leq n\).
  • Our purpose is to partition the \(n\) objects (here foods) so that objects within clusters are close and objects in different clusters are distant.
  • Each cluster contains at least one object and each object belongs to only one cluster.
  • There is a very large number of possible partitions.

Exercice 11

  • What is the number of possible partitions?
  • Hint use the Stirling numbers of the second kind Wikipedia.
  • Example of functions in R.
library(multicool)
## Warning: package 'multicool' was built under R version 4.1.3
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 4.1.2
Stirling2(8,3)
## [1] 966
Stirling2(8,1)
## [1] 1
Stirling2(3,2)
## [1] 3

K-Means

  • The discordance between the data and a given partition is denoted by \(e\).
  • We use the technique of local optimization.
  • A neighborhood of partitions is defined for each ption.
  • Starting from an initial partition, search through a set of partitions at each step.
  • Move from the partition to a partition in its neighborhood for which \(e\) is minim.
  • If the neighborhoods are very large, it is cheaper computationally to move to the first partition discovered in the neighborhood where \(e\) is reduced from its present value.
  • A number of stopping rules are possible.
  • For example, the search stops when \(e\) is not reduced by movement to the neighborhood.
  • The present partition is locally optimal in that it is the best partition in its neighborhood.
  • Consider partitions of the five (\(n=5\)) beef foods \(\{\text{BB, BR,BS,BC,BH}\}\) into three clusters (\(k=3\)).
  • Totally, there are 25 such partitions.
library(gmp)
Stirling2(5,3)
## Big Integer ('bigz') :
## [1] 25
  • A plausible neighborhood for a partition is the set of partitions obtained by transferring an object from one cluster to another.

  • For the partition (BB BR) (BS) (BC BH), the neighborhood consists of the following ten partitions:

  1. (BR) (BB BS) (BC BH)
  2. (BR) (BS) (BB BC BH)
  3. (BB) (BR BS) (BC BH)
  4. (BB) (BS) (BR BC BH)
  5. (BB BR BS) O (BC BH)
  6. (BB BR) O (BS BC BH)
  7. (BB BR BC) (BS) (BH)
  8. (BB BR) (BS BC) (BH)
  9. (BB BR BH) (BS) (BC)
  10. (BB BR) (BS BH) (BC)

K-Means Algorithm

  • Let \(\mathbf{x}_i\equiv (x_i^1,\cdots,x_i^m)\) the vector of values for the object \(i\), \(i=1,\cdots ,n.\)

  • The variables are assumed scaled.

  • The partition has \(k\) disjoint clusters: \(C_1,\cdots,C_k\), which are the indices of the objects in the various clusters.

  • Let \(n_l\) be the number of objects in cluster \(C_l\).

  • Each of the \(n\) objects lies in just one of the \(k\) clusters.

  • Note that \(\sum_{l=1}^k n_l=n\).

  • The vector of means over the objects in cluster \(C_l\) is denoted by \(\bar{\mathbf{x}}_{l}\), with \[ \bar{\mathbf{x}}_{l}\equiv\frac{1}{n_l}\sum_{i\in C_l}\mathbf{x}_{i}=(\bar{x}_l^1,\cdots,\bar{x}_l^m),\:l=1,\cdots, k, \] where \[ \bar{x}_l^j\equiv \frac{\sum_{i\in C_l}x_i^{j}}{n_l},\:j=1,\cdots, m; \:l=1,\cdots,k. \]

  • The distance between the object \(j\) and the cluster \(l\) is \(d(\mathbf{x}_i,\bar{\mathbf{x}}_l)\), where \(d\) is taken to be the Euclidian distance \[ d(\mathbf{x}_i,\bar{\mathbf{x}}_l)=||\mathbf{x}_i-\bar{\mathbf{x}}_l||=\bigg[\sum_{j=1}^m(x_i^j-\bar{x}_l^j)^2\bigg]^{1/2},\:i=1,\cdots,m;\:l=1,\cdots,k. \] where \(||\mathbf{\cdot}||\) is the Euclidian norm.

  • The error of the partition is taken to be

\[ e= \sum_{l=1}^{k}\sum_{i\in C_l} ||\mathbf{x}_i-\bar{\mathbf{x}}_l||^2. \]

  • Alternatively, we have \[ e=\sum_{i=1}^{n}\sum_{j=1}^m||\mathbf{x}_i-\bar{\mathbf{x}}_{l(j)}||^2, \]

where \(l(i)\) is the index of the cluster of object \(i\).

  • The general procedure is to search for a partition with a small error \(e\) by moving cases from one cluster to another.
  • The search ends when no such movement reduces \(e\).
  • STEP 1. Assume initial clusters. Compute the cluster means \(\bar{\mathbf{x}}_l\) and the initial error \(e\).
  • STEP 2. For the first object, compute for every cluster \(l\)

\[ \Delta e= \frac{n_ld^2(\mathbf{x}_1,\bar{\mathbf{x}}_l)}{n_l+1}-\frac{n_{l(1)}d^2(\mathbf{x}_1,\bar{\mathbf{x}}_{l(1)})}{n_{l(1)}-1},\:l=1,\cdots, k,\:l\neq l(1). \]

  • It corresponds to the error variation in transferring the first object from cluster \(l(1)\) to which it belongs to cluster \(l\).
  • If the minimum of this quantity over all \(l\neq l(1)\) is negative, transfer the first case from cluster \(l(1)\) to this minimal \(l\).
  • Adjust the cluster means of \(l(1)\) and the minimal \(l\) and add the error variation (which is negative) to \(e\).
  • STEP 3. Repeat STEP 2 for each object \(i\) such that \(2\leq i \leq n\).
  • STEP 4. lf no movement of an object from one cluster to another occurs for any case, stop. Otherwise, return to STEP 2.

Exercice 12

Prove that the error variation is indeed given by:

\[ \Delta e= \frac{n_ld^2(\mathbf{x}_1,\bar{\mathbf{x}}_l)}{n_l+1}-\frac{n_{l(1)}d^2(\mathbf{x}_1,\bar{\mathbf{x}}_{l(1)})}{n_{l(1)}-1},\:l=1,\cdots, k,\:l\neq l(1). \]

K-means applied to nutrition data

  • Only the first eight foods will be considered.
  • Only three variables, food energy, protein, and calcium as a percentage of recommended daily allowances are used.
  • The eight foods \((m=8)\) are partitioned into three clusters (\(k=3\)).

Exercice 13

  • Explain in details the k-means algorithm based on the following pages of Hartigan (1975).

#library("cluster.datasets")
#write.csv(rda.meat.fish.fowl.1959,"Hartigandat%a1.csv")
df<-read.csv("Hartigandata1.csv")
print(df)
##     X                name energy protein fat calcium iron
## 1   1        Braised beef     11      29  28       1   26
## 2   2           Hamburger      8      30  17       1   27
## 3   3          Roast beef     18      21  39       1   20
## 4   4           Beefsteak     12      27  32       1   26
## 5   5         Canned beef      6      31  10       2   37
## 6   6     Broiled chicken      8      29   3       1   14
## 7   7      Canned chicken      5      36   7       2   15
## 8   8          Beef heart      5      37   5       2   59
## 9   9      Roast lamb leg      8      29  20       1   26
## 10 10 Roast lamb shoulder      9      26  25       1   23
## 11 11          Smoked ham     11      29  28       1   25
## 12 12          Pork roast     11      27  29       1   25
## 13 13       Pork simmered     11      27  30       1   24
## 14 14         Beef tongue      6      26  14       1   25
## 15 15         Veal cutlet      6      33   9       1   27
## 16 16      Baked bluefish      4      31   4       3    6
## 17 17           Raw clams      2      16   1      10   60
## 18 18        Canned clams      1      10   1       9   54
## 19 19     Canned crabmeat      3      20   2       5    8
## 20 20       Fried haddock      4      23   5       2    5
## 21 21    Broiled mackerel      6      27  13       1   10
## 22 22     Canned mackerel      5      23   9      20   18
## 23 23         Fried perch      6      23  11       2   13
## 24 24       Canned salmon      4      24   5      20    7
## 25 25     Canned sardines      6      31   9      46   25
## 26 26         Canned tuna      5      36   7       1   12
## 27 27       Canned shrimp      3      33   1      12   26
df<-df[1:8,c(3,4,6)]
df
##   energy protein calcium
## 1     11      29       1
## 2      8      30       1
## 3     18      21       1
## 4     12      27       1
## 5      6      31       2
## 6      8      29       1
## 7      5      36       2
## 8      5      37       2
# The data set contains some errors 
df[3,1]<-13 # Error in line 3
df[6,1]<-4 # Error at line 6
df[7,3]<-1 # Error at line 7
df
##   energy protein calcium
## 1     11      29       1
## 2      8      30       1
## 3     13      21       1
## 4     12      27       1
## 5      6      31       2
## 6      4      29       1
## 7      5      36       1
## 8      5      37       2
rownames(df)<-c("BB","HR","BR","BS","BC","CB","CC","BH")
df
##    energy protein calcium
## BB     11      29       1
## HR      8      30       1
## BR     13      21       1
## BS     12      27       1
## BC      6      31       2
## CB      4      29       1
## CC      5      36       1
## BH      5      37       2
colnames(df)<-c("Energy","Protein","Calcium")
df
##    Energy Protein Calcium
## BB     11      29       1
## HR      8      30       1
## BR     13      21       1
## BS     12      27       1
## BC      6      31       2
## CB      4      29       1
## CC      5      36       1
## BH      5      37       2
km.res<-kmeans(df[1:8,],3,iter.max = 100)
km.res$cluster
## BB HR BR BS BC CB CC BH 
##  3  3  2  3  1  1  1  1
km.res$centers
##     Energy  Protein Calcium
## 1  5.00000 33.25000     1.5
## 2 13.00000 21.00000     1.0
## 3 10.33333 28.66667     1.0
km.res$totss
## [1] 267.5
sum((df[1:8,]$Energy-mean(df[1:8,]$Energy))^2)+
sum((df[1:8,]$Protein-mean(df[1:8,]$Protein))^2)+
sum((df[1:8,]$Calcium-mean(df[1:8,]$Calcium))^2)
## [1] 267.5
7*var(df[1:8,]$Energy)+7*var(df[1:8,]$Protein)+7*var(df[1:8,]$Calcium)
## [1] 267.5
km.res$withinss
## [1] 47.75000  0.00000 13.33333
km.res$tot.withinss
## [1] 61.08333
km.res$betweenss
## [1] 206.4167
km.res$size
## [1] 4 1 3
km.res$iter
## [1] 1

Exercice 14

  • Produce a heatmap for the data by rearranging the lines according to the found clusters (k=3).

Exercice 15

  • Produce the Table 4.4 of Hartigan (1975).

Exercice 16

  • Use ggplot2 to draw an x-y plot with the number of clusters on the x-axis and the error on the y-axis.

Remark: The location of a knee is generally considered as an appropriate number of clusters.

Visualizing k-means results

  • The results are visualized using fviz_cluster() function.
  • It draws a scatter plot of data points colored by cluster numbers.
  • If the data contains more than 2 variables, the Principal Component Analysis (PCA) algorithm is used to reduce the dimensionality of the data.
  • The first two principal dimensions are used to plot the data.
library(ggplot2)
library(factoextra)

fviz_cluster(km.res, df, ellipse.type = "norm")
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

fviz_cluster(km.res, data = df[1:8,],
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
ellipse.type = "euclid", # Concentration ellipse
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal()
)
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

pca=prcomp(df[1:8,], scale = TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3
## Standard deviation     1.4655 0.7938 0.47119
## Proportion of Variance 0.7159 0.2100 0.07401
## Cumulative Proportion  0.7159 0.9260 1.00000

More on PCA

data(decathlon2)
decathlon.active <- decathlon2[1:23, 1:10]
res.pca <- prcomp(decathlon.active, scale = TRUE)
summary(res.pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0308 1.3559 1.1132 0.90523 0.83759 0.65029 0.55007
## Proportion of Variance 0.4124 0.1839 0.1239 0.08194 0.07016 0.04229 0.03026
## Cumulative Proportion  0.4124 0.5963 0.7202 0.80213 0.87229 0.91458 0.94483
##                            PC8     PC9   PC10
## Standard deviation     0.52390 0.39398 0.3492
## Proportion of Variance 0.02745 0.01552 0.0122
## Cumulative Proportion  0.97228 0.98780 1.0000
fviz_pca_biplot(res.pca)

Silhouette

  • Assume the data are clustered into \(k\) clusters.

  • For \(i=1,\cdots,n\), let: \[ a_i=\frac{1}{n_{l(i)}-1}\sum_{i^\prime \in i\in C_{l(i)}\setminus\{i\}}d(i,i^\prime), \] be the mean distance between \(i\) and all the points of the same cluster.

  • If \(n_{l(i)}=1\), we set \(a_i=0\).

  • It is interpreted as a measure of how well \(i\) is assigned to its cluster (smaller the value, better is the assignment).

  • Let also \[ b_i=\min_{l\neq l(i)}{\frac {1}{|n_{l}|}}\sum_{i^\prime\in C_l}d(i,i^\prime), \] be the “neighboring cluster” of \(i\).

  • We now define a silhouette of \(i\) as \[ s_i=\frac {b_i-a_i}{\max(a_i,b_i)} \text{ if }n_{l(i)}>1, \] and \[ s_i=0\text{ if }n_{l(i)}=1. \]

  • Alternatively, we have \[ s_i= \begin{cases} 1-a_i/b_i,&\:\text{if } a_i<b_i;\\ 0,&\:\text{if } a_i=b_i;\\ b_i/a_i-1,&\:\text{if } a_i>b_i. \end{cases} \]

  • From the above definition it is clear that \[ -1\leq s_i\leq 1\]

  • For \(s_i\) to be close to 1 we require \(a_i<<b_i\).

  • If \(s_i\) is close to \(-1\), \(i\) is more appropriately clustered in its neighbouring cluster.

  • An \(s_i\) near zero means that the \(i\) is on the border of two natural clusters.

Silhouette example

library(knitr)
knitr::opts_chunk$set(echo = TRUE)
km.res<-kmeans(df[1:8,],3,iter.max = 100)
slobj<-silhouette(km.res$cluster,dist(df[1:8,]))
row.names(slobj)<-row.names(df[1:8,])

knitr::kable(slobj[,], caption = "Silhouettes values for the food example",digits = 4,col.names = c("Cluster","Neighbor","Silhouette width"),align=c("c","c","c"),booktabs = TRUE)%>%
kable_classic(latex_options = "hold_position")%>%
row_spec(1:2,background = "#2EFEF7")%>%
row_spec(3:5,background = "#F3F781")%>%
row_spec(6:8,background = "#FA58F4")
Silhouettes values for the food example
Cluster Neighbor Silhouette width
BB 2 1 -0.0053
HR 1 2 0.4659
BR 2 1 0.3785
BS 2 1 0.3921
BC 1 3 0.5168
CB 1 3 0.5312
CC 3 1 0.7764
BH 3 1 0.8062

Silhouette graph

silhouette(km.res$cluster,dist(df[1:8,]))%>%
fviz_silhouette()
##   cluster size ave.sil.width
## 1       1    3          0.50
## 2       2    3          0.26
## 3       3    2          0.79

Exercice 17

  • The author obtain the following two figures.

  • Your task is to produce the same figures using modern dataviz tools.
  • What did you learn from theses graphs?

Exercice 18

  • Provide a Silhouette graph for the Hartigan (75) data

Exercice 19

  • Provide a k-means clustering analysis of the US Arrest data

Partitioning Around Medoids (PAM, k-medoids)

  • The algorithm used in the program PAM search for \(k\) representative objects among \(n\) objects.
  • These objects should represent various aspects of the structure of the data.
  • In the PAM algorithm, the representative objects are called medoids.
  • After finding a set of \(k\) representative objects, the \(k\) clusters are constructed by assigning each object of the data set to the nearest representative object.
  • To illustrate algorithm, consider a data set of ten objects (\(n = 10\)) and two variables (\(m = 2\)).

From Kaufman & Rousseeuw, p. 68-72

  • Suppose the data set must be divided into two clusters (\(k = 2\)).
  • The algorithm first considers possible choices of two representative objects.
  • Then, it constructs the clusters around these representative objects.

  • As an example, suppose objects 1 and 5 are the selected representative objects.
  • In Table 2, the Euclidean distance from each of the objects to the two selected objects are given, as well as the smallest of these two dissimilarities and the corresponding representative object.
  • The average dissimilarity is 9.37.

  • In Table 3, the assignment is carried out for the case objects 4 and 8 are selected as representative objects.

  • The average dissimilarity for the case objects 4 and 8 are selected is 2.30, which is considerably less than the value of 9.37, found when 1 and 5 were the representative objects.
  • Alternatively a PAM program can be used by entering a matrix of dissimilarities between objects.
  • The algorithms are rather sophisticated and are not detailed here.
x=c(1,5,5,5,10,25,25,25,25,29)
y=c(4,1,2,4,4,4,6,7,8,7)
df<-data.frame(x,y)

ggplot(df,aes(x=x,y=y,label=row.names(df)))+geom_point()+geom_text(label=row.names(df),nudge_x = 0.25, nudge_y = 0.25, 
    check_overlap = T)

dist(df,upper=T,diag=T)%>%
as.matrix()->M
cbind(M[,c(1,5)],apply(M[,c(1,5)],1,min))%>%
data.frame()->T2
T2$val=0
T2$val[which(T2[,1]==apply(M[,c(1,5)],1,min))]<-1
T2$val[which(T2[,2]==apply(M[,c(1,5)],1,min))]<-5
names(T2)<-c("Dist. wrt 1","Dist. wrt 5","Min. dist.","Closest rep. obj.")


kable(T2,digit =2,align = "lccc",booktabs = TRUE)%>%
kable_classic(latex_options = "hold_position")
Dist. wrt 1 Dist. wrt 5 Min. dist. Closest rep. obj.
0.00 9.00 0.00 1
5.00 5.83 5.00 1
4.47 5.39 4.47 1
4.00 5.00 4.00 1
9.00 0.00 0.00 5
24.00 15.00 15.00 5
24.08 15.13 15.13 5
24.19 15.30 15.30 5
24.33 15.52 15.52 5
28.16 19.24 19.24 5
mean(T2$`Min. dist.`)
## [1] 9.36615
dist(df,upper=T,diag=T)%>%
as.matrix()->M
cbind(M[,c(4,8)],apply(M[,c(4,8)],1,min))%>%
data.frame()->T3
T3$val=0
T3$val[which(T3[,1]==apply(M[,c(4,8)],1,min))]<-1
T3$val[which(T3[,2]==apply(M[,c(4,8)],1,min))]<-5
names(T3)<-c("Dist. wrt 4","Dist. wrt 8","Min. dist.","Closest rep. obj.")


kable(T3,digit =2,align = "lccc",booktabs = TRUE)%>%
kable_classic(latex_options = "hold_position")
Dist. wrt 4 Dist. wrt 8 Min. dist. Closest rep. obj.
4.00 24.19 4 1
3.00 20.88 3 1
2.00 20.62 2 1
0.00 20.22 0 1
5.00 15.30 5 1
20.00 3.00 3 5
20.10 1.00 1 5
20.22 0.00 0 5
20.40 1.00 1 5
24.19 4.00 4 5
mean(T3$`Min. dist.`)
## [1] 2.3
PAM<-pam(df,2)
PAM$medoids
##       x y
## [1,]  5 2
## [2,] 25 7
PAM$id.med
## [1] 3 8
PAM$clustering
##  [1] 1 1 1 1 1 2 2 2 2 2
PAM$objective
##    build     swap 
## 3.421612 2.185730
PAM$isolation
##  1  2 
## L* L* 
## Levels: no L L*

\(L\) and \(L^\star\) clusters

Dist. wrt 3 Dist. wrt 8 Min. dist. Closest rep. obj.
4.47 24.19 4.47 3
1.00 20.88 1.00 3
0.00 20.62 0.00 3
2.00 20.22 2.00 3
5.39 15.30 5.39 3
20.10 3.00 3.00 8
20.40 1.00 1.00 8
20.62 0.00 0.00 8
20.88 1.00 1.00 8
24.52 4.00 4.00 8
## [1] 2.18573
##        1        2        3        4        5 
## 9.000000 5.830952 5.385165 5.000000 9.000000
##        1        2        3        4        5 
## 24.00000 20.22375 20.09975 20.00000 15.00000
##    1    2    3    4    5 
## TRUE TRUE TRUE TRUE TRUE
## [1] 9
## [1] 15
## [1] TRUE
##        6        7        8        9       10 
## 5.000000 4.123106 4.000000 4.123106 5.000000
##        6        7        8        9       10 
## 15.00000 15.13275 15.29706 15.52417 19.23538
##    6    7    8    9   10 
## TRUE TRUE TRUE TRUE TRUE
## [1] 5
## [1] 15
## [1] TRUE

Multidimensional scaling

  • The role of MDS is to construct a map like the one in the following Figure for those who do not know the “geography” in certain areas.

df<-read.csv("FlyingMileage.csv",sep = ";")
df%>%kable(booktabs = TRUE)%>%
kable_classic(latex_options = "hold_position")
X Atl Chi Den Hou LA Mia NY SF Sea DC
Atl 0 587 1212 701 1936 604 748 2139 2182 543
Chi 587 0 920 940 1745 1188 713 1858 1737 597
Den 1212 920 0 879 831 1726 1631 949 1021 1494
Hou 701 940 879 0 1374 968 1420 1645 1891 1220
LA 1936 1745 831 1374 0 2339 2451 347 959 2300
Mia 604 1188 1726 968 2339 0 1092 2594 2734 923
NY 748 713 1631 1420 2451 1092 0 2571 2408 205
SF 2139 1858 949 1645 347 2594 2571 0 678 2442
Sea 2182 1737 1021 1891 959 2734 2408 678 0 2329
DC 543 597 1494 1220 2300 923 205 2442 2329 0
  • We denote this matrix of distances (or dissimilarities) by \(D\) with compnents \(d_{i,i^\prime},\:i,i^\prime=1\cdots n.\).

  • The puprpose of the the basic MDS method is to build \(n\) points in the \(\mathbb{R}^p\) space, or using \(p\) dimensions.

  • The coordinates of the point \(i\) are \((x_i^1,\cdots,x_i^p)\). such as the matrix of distance between the points fits the observed matrix distances \(D\).

  • The MDS distance between points are assumed to be \[\hat{d}_{i,i^\prime}=\sqrt{\sum_{l=1}^p(x_{i^\prime}^l-x_{i}^l)^2}.\]

  • Kruskal’s STRESS formula one is given by

\[ S_1=\sqrt{\frac{\sum_{i,i^\prime}(\hat{d}_{i,i^\prime}-d_{i,i^\prime})^2}{\sum_{i,i^\prime}d_{i,i^\prime}^2}} \] * The numerator is squared error, indicating the sum of differences between the observed distance and model-derived distance. The denominator is normalizing factor. * We want to find vectors which produce the smallest possible value for stress.

mds<-df%>%select(2:11)%>%
cmdscale(,2)
mds<-mds$points%>% as_tibble()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
colnames(mds) <- c("Dim.1", "Dim.2")
ggplot(mds,aes(x = Dim.1,y=Dim.2))+
geom_point()+geom_text(label=df$X,,hjust=0.5, vjust=3, size=2)

# VISUALISATION DES INÉGALITÉS DE REVENU EN EUROPE

Indice de Gini

Si l’on dispose de données brutes de revenus ou d’une richesse quelconque, qui sont des valeurs positives notées \(x_1,x_2,\cdots,x_n\), où \(n\) est le nombre d’individus, l’indice de Gini associé à ces valeurs, noté \(G\), est défini par : \[ G=\frac{\sum_{i=1}^n\sum_{j=1}^n|x_i-x_j|/n^2}{2\mu}, \]\[ \mu=\frac{\sum_{i=1}^n x_i}{n}. \]

  • Cet indice est souvent utilisé pour mesurer les inégalités de revenu dans une population.

  • On remarque que l’orsqu’on double les salaires, \(G\) reste invariant

  • Si l’on rajoute un même montant a tous les individus, le numérateur de \(G\) reste invariant tandis que le dénominateur croît. Ainsi, \(G\) va baisser.

-On rappelle que si \(a\) et \(b\) sont deux nombres, \(|a-b|=a + b -2\min(a,b)\), cela induit une nouvelle expression de l’indice de Gini

\[ G=1-\frac{\sum_{i=1}^n\sum_{j=1 }^n \min(x_i,x_j)/n^2}{\mu}. \] # Les donnees du projet Nous allons utiliser les donnees de Eurostat Eurostat - Ce site fournis les donnees sur les distributions *salaires en Europe ce sont les quartiles, les quantiles, les deciles, etc., qui sont fournis par le site. Voici un image d’une information fournie.

stat

Dans le tableau ci-dessus le chiffre 10,604 correspond au premier quartile dans la zone EU-28. Cela veut dire que 25% des habitants de la zone ont revenu inferieur ou egale a ce montant.

#importation of the data

Librairies utilisées

library(eurostat)
library(tidyverse)
library(lubridate) # lubridate is not part of core tidyverse.
library(countrycode)
library(gganimate)
library(countrycode)
library(maps)
library(knitr)
library(stargazer)
library(gganimate)
library(gifski)
library(png)
library(cluster)
library(magrittr)
library(factoextra)

Revenu moyen par pays et par année

  • On charge les données à partir du package eurostat.
df<-get_eurostat(id="ilc_di03")%>%
filter(unit == "EUR",age == "TOTAL",sex == "T",indic_il == "MEI_E")%>%
mutate(year = year(time))
# MEI_E = Mean equivalised net income
df$name<-countrycode(df$geo,"iso2c","country.name")
df<-df%>%select(year,geo,name,values)%>%
filter(!is.na(name))%>%arrange(name,year)
  • On calcule des statistiques descriptives en utilisant le package stargazer.
df%>%select(-geo)%>%spread(name,values)%>%select(-year)%>%as.data.frame()%>%
stargazer(title = "Descriptive statistics for the mean revenue in Europe",digits = 0,type = "latex")
% Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com % Date and time: Thu, May 12, 2022 - 11:43:38 PM

Cartes avec ggplot2

map <- map_data("world", region = df$name)
map$geo = countrycode(map$region,origin = "country.name", destination = "iso2c")
# Calcule des centroïdes (longitude et lattitude moyennes)
country.lab <- map %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

map%>%ggplot(aes(x = long, y = lat)) +
  geom_polygon(aes( group = group, fill = region))+
  geom_text(aes(label = region), data = country.lab,  size = 3, hjust = 0.5)+
  scale_fill_viridis_d()+
  theme_void()+
  theme(legend.position = "none")

Jointure of two files

joined <- left_join(map,df,by = "geo",eval = FALSE)
# Calcule des centroïdes (longitude et lattitude moyennes)
country.lab <- joined %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))
joined%>%filter(year == 2020)%>% ggplot(aes(x = long, y = lat)) +
  geom_polygon(aes( group = group, fill = values))+
geom_text(aes(label = region), data = country.lab,  size = 3, hjust = 0.5)+
  xlim(-20,45)+ylim(35, 71)+ggtitle("Revenu moyen par pays en 2020")

head(joined) %>% filter(year== 2019)
##       long      lat group order  region subregion geo year    name values
## 1 20.06396 42.54727     1     1 Albania      <NA>  AL 2019 Albania   2619
df2019 <- df %>% filter(year==2019) %>% select(values, name) 
# Transformer le dataframe en matrice
df2019matrix <- df2019$values %>% as.matrix()
#Sert à partir de la même configuration initialle 
set.seed(1)
rownames(df2019matrix) = df2019$name
# Mèthode pour voir le nombre idéel de "K"
fviz_nbclust(df2019matrix, kmeans, method = "wss", k.max = 5)

#le graphique montre que deux clusters est approprié
#on fait l'analyse kmeans
km <- kmeans(df2019matrix, centers = 2 )
#on récupere l'appartenance des pays aux cluster
km$cluster
##         Albania         Austria         Belgium        Bulgaria         Croatia 
##               2               1               1               2               2 
##          Cyprus         Czechia         Denmark         Estonia         Finland 
##               2               2               1               2               1 
##          France         Germany         Hungary         Ireland           Italy 
##               1               1               2               1               2 
##          Latvia       Lithuania      Luxembourg           Malta      Montenegro 
##               2               2               1               2               2 
##     Netherlands North Macedonia          Norway          Poland        Portugal 
##               1               2               1               2               2 
##         Romania          Serbia        Slovakia        Slovenia           Spain 
##               2               2               2               2               2 
##          Sweden     Switzerland          Turkey 
##               1               1               2
df2019 <- df2019 %>% mutate(cluster = km$cluster)
#on va joindre la table précédente aux données géographiques
data_cluster <- left_join(df2019, map, by= c("name" = "region"))

#on la cartographie
data_cluster%>%ggplot(aes(x = long, y = lat)) +
  geom_polygon(aes( group = group, fill = cluster))+
  geom_text(aes(label = region), data = country.lab,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  theme_void()+
  theme(legend.position = "none") + ggtitle("2 groups of countries in Europe by income ")

Graphs animés avec ggplot2

dfrank<-df %>% group_by(name) %>%
  summarise(x = mean(values))%>%mutate(x = rank(x))

left_join(df,dfrank, by = "name")%>%
ggplot(aes(x = x , y =values,
color=name))+geom_text(aes(label = name),size=2)  +
theme(legend.position="none",axis.ticks.x = element_blank(),axis.text.x = element_blank())+  transition_time(year)+labs(title = 'year: {round(frame_time)}', x = '', y = 'Revenu moyen')

Income in european countries animated by year from 1995 to 2020

str(df)
## tibble [592 x 4] (S3: tbl_df/tbl/data.frame)
##  $ year  : num [1:592] 2017 2018 2019 2020 1995 ...
##  $ geo   : chr [1:592] "AL" "AL" "AL" "AL" ...
##  $ name  : chr [1:592] "Albania" "Albania" "Albania" "Albania" ...
##  $ values: num [1:592] 2143 2339 2619 2943 15708 ...
df%>% ggplot(aes(x = name , y = values, label = name , color = name))+
geom_point(stat = 'identity',size = 15)+
geom_segment(aes(y = 100,x = name, yend = values , xend = name ))+
geom_text(color = "black", size = 3)+coord_flip()+
theme(legend.position = 'none')+
labs(title = 'Année : {round(frame_time)}')+
  transition_time(year)+ease_aes('linear')  

Income in european countries animated by country from 1995 to 2020

df%>%ggplot(aes(x = year ,y = values,group = name,color = name))+geom_line()+transition_reveal(year)

Animated map

mapdyn<-joined %>% group_by(name) %>% 
mutate(long.mean = mean(long),lat.mean = mean(lat))



mapdyn%>%filter(year == 2020)%>%ggplot(aes(x = long , y = lat ))+
  geom_polygon(aes( group = group, fill = name))+
  theme(legend.position = 'none')+
geom_text(aes(x = long.mean,y = lat.mean,label = region,size = values), hjust = 0.5)+
  xlim(-20,45)+ylim(35, 71)+ggtitle("Revenu moyen par pays en 2020")+  geom_point(aes(x = long.mean,lat.mean,colour = values,size = values))

mapdyn%>% filter(year == 2020) %>% ggplot(aes(x = long.mean , y = lat.mean ))+
geom_point(aes(x = long.mean,y = lat.mean,name = region,size = values), hjust = 0.5)

mapdyn %>% ggplot(aes(x = long.mean , y = lat.mean ))+
geom_point(aes(x = long.mean,y = lat.mean,label= region,size = values), hjust = 0.5)+transition_reveal(year)

Annexe: Cartes avec ggplot2

require(maps)
data(us.cities)
big_cities <- subset(us.cities, pop > 500000)
qplot(long,lat,data = big_cities) + borders("state", size = 0.5)

# border: A quick and dirty way to get map data 
# (from the maps package).
# This is a good place to start, 
# You'll typically want something more sophisticated for communication graphics.
ca_cities <- subset(us.cities, country.etc == "CA")
ggplot(ca_cities, aes(long, lat)) + 
borders(database="county", regions="california", color = "grey70") + geom_point()

#us.cities: A list with 6 components, 
# "name", "country.etc", "pop", "lat", "long", and "capital",
# the city name, the state abbreviation, population (2006),
# latitude, longitude, 
# capital status indication (0 for non-capital, 1 for capital, 2 for state capital.
data(world.cities)
capitals <- subset(world.cities, capital == 1)
ggplot(capitals, aes(long, lat)) + 
borders("world", fill="lightblue", col="cornflowerblue") +
geom_point(aes(size = pop),col="darkgreen")

PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The {i}i-th principal component can be taken as a direction orthogonal to the first {i-1}i-1 principal components that maximizes the variance of the projected data.

x=c(1,5,5,5,10,25,25,25,25,29)
y=c(4,1,2,4,4,4,6,7,8,7)
df <- data.frame(x,y)
#choose(30000,3)
PAM <- pam(df,2)
PAM$medoids
##       x y
## [1,]  5 2
## [2,] 25 7
summary(PAM)
## Medoids:
##      ID  x y
## [1,]  3  5 2
## [2,]  8 25 7
## Clustering vector:
##  [1] 1 1 1 1 1 2 2 2 2 2
## Objective function:
##    build     swap 
## 3.421612 2.185730 
## 
## Numerical information per cluster:
##      size max_diss av_diss diameter separation
## [1,]    5 5.385165 2.57146        9         15
## [2,]    5 4.000000 1.80000        5         15
## 
## Isolated clusters:
##  L-clusters: character(0)
##  L*-clusters: [1] 1 2
## 
## Silhouette plot information:
##    cluster neighbor sil_width
## 3        1        2 0.8491030
## 4        1        2 0.8331846
## 2        1        2 0.8277844
## 1        1        2 0.7748486
## 5        1        2 0.6069286
## 8        2        1 0.8888381
## 7        2        1 0.8863332
## 9        2        1 0.8641158
## 6        2        1 0.8238081
## 10       2        1 0.8215954
## Average silhouette width per cluster:
## [1] 0.7783698 0.8569381
## Average silhouette width of total data set:
## [1] 0.817654
## 
## 45 dissimilarities, summarized :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.123  15.297  13.310  20.616  28.160 
## Metric :  euclidean 
## Number of objects : 10
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
df_Rousseeuw<-read.csv("Political_science_data - Sheet1.csv")
row.names(df_Rousseeuw)=df_Rousseeuw$X
df_Rousseeuw<-df_Rousseeuw%>%
  select(-X)%>%
  as.matrix()
df_Rousseeuw
##      BEL  BRA  CHI  CUB  EGY  FRA  IND  ISR  USA  USS  YUG  ZAI
## BEL 0.00 5.58 7.00 7.08 4.83 2.17 6.42 3.42 2.50 6.08 5.25 4.75
## BRA 5.58 0.00 6.50 7.00 5.08 5.75 5.00 5.50 4.92 6.67 6.83 3.00
## CHI 7.00 6.50 0.00 3.83 8.17 6.67 5.58 6.42 6.25 4.25 4.50 6.08
## CUB 7.08 7.00 3.83 0.00 5.83 6.92 6.00 6.42 7.33 2.67 3.75 6.67
## EGY 4.83 5.08 8.17 5.83 0.00 4.92 4.67 5.00 4.50 6.00 5.75 5.00
## FRA 2.17 5.75 6.67 6.92 4.92 0.00 6.42 3.92 2.25 6.17 5.42 5.58
## IND 6.42 5.00 5.58 6.00 4.67 6.42 0.00 6.17 6.33 6.17 6.08 4.83
## ISR 3.42 5.50 6.42 6.42 5.00 3.92 6.17 0.00 2.75 6.92 5.83 6.17
## USA 2.50 4.92 6.25 7.33 4.50 2.25 6.33 2.75 0.00 6.17 6.67 5.67
## USS 6.08 6.67 4.25 2.67 6.00 6.17 6.17 6.92 6.17 0.00 3.67 6.50
## YUG 5.25 6.83 4.50 3.75 5.75 5.42 6.08 5.83 6.67 3.67 0.00 6.92
## ZAI 4.75 3.00 6.08 6.67 5.00 5.58 4.83 6.17 5.67 6.50 6.92 0.00
#Apply PAM analysis
PAM_Rousseeuw<-pam(df_Rousseeuw,diss = TRUE,2)
PAM_Rousseeuw
## Medoids:
##      ID       
## [1,] "9" "USA"
## [2,] "4" "CUB"
## Clustering vector:
## BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG ZAI 
##   1   1   2   2   1   1   2   1   1   2   2   1 
## Objective function:
##    build     swap 
## 3.291667 3.236667 
## 
## Available components:
## [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
## [6] "clusinfo"   "silinfo"    "diss"       "call"
s<-silhouette(PAM_Rousseeuw$cluster, dist(df_Rousseeuw))
row.names(s)<-row.names(df_Rousseeuw)
fviz_silhouette(s, label=TRUE)
##   cluster size ave.sil.width
## 1       1    7          0.34
## 2       2    5          0.33

sil_Rousseeuw<-silhouette(PAM_Rousseeuw$clustering, df_Rousseeuw)
row.names(sil_Rousseeuw)=row.names(df_Rousseeuw)
sil_Rousseeuw
##     cluster neighbor   sil_width
## BEL       1        2  0.39129752
## BRA       1        2  0.22317708
## CHI       2        1  0.32512211
## CUB       2        1  0.39814815
## EGY       1        2  0.19652641
## FRA       1        2  0.35152954
## IND       2        1 -0.04466159
## ISR       1        2  0.29785894
## USA       1        2  0.42519084
## USS       2        1  0.34104696
## YUG       2        1  0.26177642
## ZAI       1        2  0.18897849
## attr(,"Ordered")
## [1] FALSE
## attr(,"call")
## silhouette.default(x = PAM_Rousseeuw$clustering, dist = df_Rousseeuw)
## attr(,"class")
## [1] "silhouette"
fviz_silhouette(sil_Rousseeuw,label=TRUE, print.summary = TRUE) +
  coord_flip()
##   cluster size ave.sil.width
## 1       1    7          0.30
## 2       2    5          0.26